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1  Statement  of  Problems  Studied 

Under  ARO  grant  (DAAG55-98-1-0341)  for  the  period  of  June,  1998  through  May,  2002,  the 
principle  investigator  conducted  statistical  methodology  research  on  Minimum  Description 
Length  (MDL)  principle  and  its  applications,  on  microrray  image  compression  and  data 
analysis,  and  on  classification  based  on  hyperspectral  measurements  in  remote  sensing. 


2  Summary  of  the  Most  Important  Results 

I,  Tree  recoginition  based  on  hyperspectral  measurements 

Hyperspectral  data  consist  of  intensity  readings  of  hundreds  of  bands  from  ground  or  airborne 
spectrometers,  while  multispectrai  data  have  typically  4-7  bands.  When  plotting  against  the 
wavelength,  a  perspectral  measurement  gives  a  smooth-looking  curve. 

LI  Conifer  Tree  Species  Recognition 

Using  a  high  spectral  resolution  spectrometer,  PSD1000  (ANCAL,  1995),  measurements 
were  taken  at  the  Blodgett  Forest  Research  Station  of  the  University  of  California,  Berke¬ 
ley,  located  on  the  western  slope  of  the  central  Sierra  Nevada,  El  Dorado  County,  Califor¬ 
nia.  There  are  322  measurements  on  6  conifer  species  with  equal  proportions:  Sugar-pine 
(SP,  Pinus  lambertiana),  Ponderosa-pine  (PP,  PInus  ponderosa),  White-fir  (WF,  Abies  con- 
color),  Douglas-fir  (DF,  Pseudotsuga  menziesii),  Incense-cedar  (IC,  Calocedrus  decurrens), 
and  Giant-sequoia  (GS,  Sequoiadendron  giganteum).  After  pre-processing,  each  measure¬ 
ment  consists  of  179  bands  from  350  mn  and  to  900  nm. 

Yu  et  al  (1999)  uses  hyperspectral  measurements  collected  in  the  Sierra  Nevada  Mountains 
in  California  to  discriminateu  six  species  of  conifer  trees  using  a  recent  non-parametric  statis- 
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tics  technique  known  as  Penalized  Discriminant  Analysis  (PDA).  A  classificati  hyperspectral 
measurements  collected  in  the  Sierra  Nevada  Mountains  in  California  to  discriminate  six 
species  of  conifer  trees  using  a  recent  non-parametric  statistics  technique  known  as  Penal¬ 
ized  Discriminant  Analysis  (PDA).  A  classification  accuracy  of  76obtained.  The  emphasis  is 
on  providing  an  intuitive,  geometric  description  of  PDA  that  makes  the  advantages  of  pe¬ 
nalization  clear.  PDA  is  a  penalized  version  of  Fisher’s  Linear  Discriminant  Analysis  (LDA) 
that  can  greatly  improve  upon  LDA  when  there  are  This  discriminative  power  of  hyperspec¬ 
tral  data  for  conifer  tree  species  recognition  opens  up  the  possibility  of  automatic  species 
identification  (in  contrast  to  human-involved  photo  interpretation)  based  on  hyperspectral 
data  for  forestry  management  at  a  large  scale.  However,  most  airborne  spectrometers  have 
only  several  bands  operating  at  a  time  instead  of  hundreds  as  used  in  the  above  works. 
On-going  research  with  M.  Hansen,  M.  Ostland  and  P,  Gong  aims  at  using  the  above  hy¬ 
perspectral  data  collected  on  the  ground  to  select  the  most  discriminative  bands  to  be  used 
with  airborne  spectrometers.  This  research  uses  MDL  and  other  model  selection  criteria,  1.2 
Identification  of  Healthy  vs  Infected  Oak  Trees  by  Sudden  Oak  Death  (SOD) 

In  Spring  2001,  a  new  project  started  with  the  help  of  a  new  graduate  student  Dave  Graham- 
Squire  to  use  hyperspectral  measurements  for  the  identification  of  healthy  vs.  infected  oak 
trees  by  Sudden  Oak  Death,  which  is  a  major  epidemic  threatening  oak  trees  in  California 
and  other  west  coast  states. 

It  is  again  in  collaboration  with  Prof.  Gong’s  group  in  ESPM  at  berkeley.  We  have  so  far 
obtained  some  preliminary  results. 

II.  Mininum  Description  Length  (MDL)  Principle 

The  PI  continued  her  research  program  on  MDL  in  the  funding  period  in  three  different 
directions. 

II.  1  A  Thorough  Review  on  MDL 

Hansen  and  Yu  (2001)  focuses  on  reviewing  the  field  of  MDL  and  pressing  for  practical 
applications  of  MDL  such  as  in  regression  variable  selection.  Aiming  at  exposing  MDL  to 
more  statisticians,  this  paper  reviews  and  synthesizes  MDL  in  the  context  of  frequentist  and 
Bayesian  statistics,  illustrating  the  connection  with  real  data  sets.  We  make  the  point  that 
MDL  generalizes  the  maximum  likelihood  principle  of  frequentist  statistics  to  model  selection 
problems  and  it  shares  many  formal  derivations  of  Bayesian  statistics.  We  study  three  forms 
of  MDL  criteria  in  regression  problems,  of  which  one  is  our  close-form  new  invention.  These 
criteria  are  investigated  in  a  genetics  example,  a  fruit-classification  example  using  NRC  data 
and  a  simulation  study.  It  is  more  than  interesting  to  note  that  all  three  MDL  criteria 
automatically  select  the  ’biologically  correct’  model  in  the  genetics  example  while  AIC  or 
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BIG  have  to  be  hand  tuned  to  do  so.  In  general,  these  MDL  criteria  are  found  to  behave 
either  like  AIC  or  BIC  depending  on  which  is  more  desirable.  Hansen  and  I  are  now  studying 
their  adaptivity  and  comparing  frequentist  and  Bayesian  procedures  in  the  MDL  framework. 

11. 2  Simultaneous  Denoising  and  Compression  via  MDL:  Adaptive  Wavelet  Thresholding 

With  the  massive  amount  of  data  available  in  almost  every  field  of  statistical  applications,  the 
compression  aspect  of  data  needs  to  be  addressed  formally  in  statistical  inference.  Wavelet 
transformation  provides  an  ideal  framework  for  such  a  first  study.  Minimum  description 
length  (MDL)  criteria  are  studied  in  Hansen  and  Yu  (2000)  for  model  selection  as  flexible 
forms  of  thresholding  for  wavelet  denoising  and  compression.  Mixture  MDL  methods  based 
on  a  single  Laplacian,  a  two-piece  Laplacian,  and  a  generalized  Gaussian  prior  are  shown  to  be 
adaptive  thresholding  rules.  The  MDL  procedures  achieve  mean  squared  errors  comparable 
with  other  popular  thresholding  schemes,  but  they  tend  to  keep  far  fewer  coefficients.  From 
this  property,  we  demonstrate  that  our  methods  represent  excellent  tools  for  simultaneous 
denoising  and  compression.  We  make  this  claim  precise  by  analyzing  MDL  thresholding  in 
two  optimality  frameworks;  one  in  which  we  measure  rate  and  distortion  based  on  quantized 
coefficients  and  one  in  which  we  do  not  quantize,  but  instead  record  rate  simply  as  the 
number  of  non-zero  coefficients. 

11.3  Other  MDL  works 

Pathwise  expansions  are  obtained  in  Li  and  Yu  (2000)  for  the  predictive  and  mixture  code 
lengths  used  in  MDL.  The  expansions  are  for  exponential  families  and  to  the  constant  order. 
The  results  are  useful  for  understanding  different  MDL  forms  and  provide  upper  bounds  on 
the  Kolmogorov  complexity  of  individual  strings. 

Rissanen  and  Yu  (2000)  is  an  invited  vignette  to  commemorate  Year  2000  by  the  leading 
statistics  journal  JASA.  It  explains  briefly  the  theoretical  pinnings  of  many  information 
technology  products  and  advoates  that  theconnection  between  statistics  and  information 
theory  is  well  worth  the  exploration  by  statisticians. 

III.  Microarray  image  compression  and  data  analysis 

A  Ph.D.  thesis  was  completed  by  R.  Jornsten  under  the  Pi’s  supervision  on  microarray 
image  compression  and  data  analysis.  It  deals  directly  with  the  microarray  applications, 
but  addresses  in  general  the  problem  of  data  compression  and  its  implications  for  statistical 
inference.  In  particular,  we  consider  the  following  three  questions.  How  can  we  quantify  the 
effect  of  compression  on  statistical  inference?  How  should  a  compression  scheme  be  designed 
such  that  the  effect  of  compression  on  inference  is  minimal?  How  can  the  Minimum  Descrip¬ 
tion  Length  (MDL)  principle  be  used  for  model  selection  with  an  extraordinary  number  of 
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dependent  predictors?  In  this  thesis,  we  attempt  to  answer  these  three  questions  in  a  gen¬ 
eral  setting,  and  with  a  specific  application  in  the  compression  and  analysis  of  microarray 
images.  The  results  from  the  thesis  are  being  published  in  conference  and  journal  papers  as 
described  below. 

111.1  Multiterminal  Data  Compression 

In  Jornsten  and  Yu  (2002a),  we  present  new  results  in  the  context  of  multiterminal  data 
compression.  We  derive  an  improved  upper  bound  on  the  asymptotic  estimation  efficiency 
under  rate  constraints.  Furthermore,  we  give  a  geometric  interpretation  of  the  new  bound, 
which  provides  insights  into  the  nature  of  the  multiterminal  estimation  problem.  The  bound 
on  asymptotic  estimation  efficiency  gives  a  gold  standard,  by  which  practical  compression 
schemes  can  be  evaluated,  and  the  effect  of  compression  on  estimation  analyzed. 

111.2  Microarray  Image  Compression 

In  Jornsten  et  al  (2002),  we  present  a  progressive  lossy  and  lossless  compression  scheme 
for  microarray  images.  The  microarray  image  technology  makes  possible  the  simultaneous 
measurement  of  expression  levels  of  thousand  of  genes.  These  images  have  become  the 
standard  tools  to  investigate  fundamental  biological  functions  such  as  gene  regulation  and 
interaction,  and  to  discover  genetic  pathways  for  diseases  such  as  cancer.  They  are  widely 
used  in  laboratories  of  academia  and  industry,  producing  vast  quantities  of  image  data. 
Our  compression  scheme  has  been  tailored  to  the  microarray  image  application,  such  that 
the  essential  statistical  information  in  the  images  is  well-preserved  at  low  bit-rates.  The 
compression  scheme  has  a  multi-level  coded  data  structure,  which  allows  for  fast  re-processing 
and  transmission  of  image  subsets. 

111.3  Simultaneous  Gene  Selection  and  Cluster  Analysis  via  MDL 

The  information  extracted  from  microarray  image  experiments  provides  statisticians  with 
formidable  data  analysis  tasks.  In  current  research,  particular  attention  has  been  given  to 
the  problems  of  gene  clustering  and  sample  classification.  Each  microarray  experiment,  or 
sample,  corresponds  to  a  type  of  tissue,  tumor  or  stage  of  development.  In  gene  clustering,  the 
goal  is  to  identify  genes  that  exhibit  similar  expression  levels  across  samples  or  experiments. 
In  sample  classification,  a  collection  of  gene  expressions  is  used  to  build  a  predictive  model  for 
the  sample  type.  In  Jornsten  and  Yu  (2002b),  we  present  a  new  MDL  (Minimum  Description 
Length)  model  selection  criterion  for  the  simultaneous  clustering  of  genes,  and  selection  of 
subsets  of  gene  clusters  that  function  as  sample  class  predictors.  For  the  first  time,  an  MDL 
selection  criterion  is  given  for  both  predictor  variables  (genes)  and  response  variables  (sample 
class  labels).  We  are  able  to  build  parsimonious  classifiers  using  our  MDL  model  selection 
criterion  that  performs  better,  or  as  well  as  the  best  methods  reported  in  the  literature.  Our 
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MDL  model  selection  criterion  is  generally  applicable  to  prediction  problems  with  highly 
correlated  predictors,  where  the  number  of  predictors  significantly  exceeds  the  number  of 
samples. 
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